dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55) by dusterbloom · Pull Request #62 · Luce-Org/lucebox-hub

dusterbloom · 2026-04-29T16:56:24Z

Fixes #55.

Root cause

Every spec-decode iteration calls build_target_step_tree (target verify, ~3127 ggml graph nodes) at dflash/test/test_dflash.cpp:1703 and build_draft_step (draft forward, ~186 nodes) at dflash/test/test_dflash.cpp:1556 on the same StepGraph sg, sharing one ggml_gallocr. ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology — forcing ggml_gallocr_reserve to re-walk the entire graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer (GPU driver cost).

Reporter on Windows/RTX 4090/CUDA 13 sees ggml_gallocr_needs_realloc: graph has different number of nodes log spam every step (the message is #ifndef NDEBUG gated, so Linux Release builds are silent but pay the same cost). Decode tok/s halves from 90 @ 16k context to 55 @ 32k context.

Fix

Split the shared StepGraph sg into target_sg and draft_sg, each with its own ggml_gallocr. Target verify settles into the 3127-node topology, draft into 186-node, neither bounces.

 dflash/test/test_dflash.cpp | 42 ++++++++++++++++++++++++++++--------------

The diff is minimized via a StepGraph & sg = target_sg; alias so the existing prefill/target-verify call sites are unchanged; only the draft block (~10 references) swaps sg.X for draft_sg.X. Daemon-mode reset and the two migrate-cache sites destroy both StepGraphs.

Verification

Patched ggml_gallocr_alloc_graph to unconditionally fprintf to stderr at each "needs_realloc returns true" site (removing the #ifndef NDEBUG gate). Ran test_dflash on a tokenized HE prompt + n_gen=256 + --ddtree-budget=22 + --max-ctx=2048 + --fast-rollback. Same prompt, same flags, before vs after this commit:

	Before	After
`needs_realloc` events over 26 steps	56	3 (initial only)
`cudaFree+cudaMalloc` events during decode	14	0

Reasons breakdown before fix:

26: n_nodes 186 → 3127
25: n_nodes 3127 → 186 (the alternation)
3: per-tensor size grew (monotonic kv_pad growth)
2: initial reserves

After fix: just the 3 initial reserves (one per gallocr, each fired exactly once at first use).

Bench (RTX 3090 / Linux / CUDA 12.6)

bench_he.py --n-gen 128 --ddtree-budget 22, 3-run mean:

main: 86.72 tok/s (runs: 85.38, 88.05, 86.72)
this fix: 84.99 tok/s (runs: 84.98, 81.71, 88.30)

Within bench-noise. This is consistent with the hypothesis that on Linux/CUDA 12.6 the per-step cudaFree+cudaMalloc cost is small (driver fast-paths the alloc), so eliminating it doesn't show up as decode tok/s. The reporter's Windows/CUDA 13 stack has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery — needs verification on their box.

Test plan

Build clean (Release).
bench_he.py parity (within noise).
Bit-exact correctness preserved across the fix (same step count, same accept length, same final committed token).
Instrumentation confirms the realloc-per-step pattern is gone.
Reporter (@gtrak) verifies on Windows / RTX 4090 / CUDA 13 — does this recover decode tok/s at 16k+ context?

What this does NOT fix

Residual needs_realloc events from monotonic kv_pad growth at long context — these are rare boundary crossings (every ~256 tokens), not per-step. Codex review flagged these as low-severity; not chasing unless the reporter still sees churn.
Codex review also flagged the StepGraph & sg = target_sg; alias as a low-severity readability footgun — a follow-up s/sg/target_sg/g would clarify. Held to keep this PR minimal.

…-decode step Issue Luce-Org#55: every spec-decode iteration calls build_target_step_tree (target verify, ~3127 graph nodes) and build_draft_step (draft forward, ~186 graph nodes) on the SAME StepGraph, sharing one ggml_gallocr. ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology, forcing ggml_gallocr_reserve to re-walk the entire graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer (GPU driver cost). Reporter on Windows/RTX 4090 sees the "graph has different number of nodes" debug log fire every step and decode tok/s halving from 90 @ 16k context to 55 @ 32k context. Fix: introduce target_sg and draft_sg, each with its own ggml_gallocr. Target verify settles into the 3127-node graph topology, draft into the 186-node topology, and neither bounces. Existing prefill / target-verify call sites keep their `sg` references via a StepGraph & sg = target_sg alias; only the draft block (~10 calls) swaps `sg.X` for `draft_sg.X`. Daemon-mode reset and migrate-cache sites destroy both StepGraphs. Verified with one-line instrumentation patch on ggml_gallocr_alloc_graph (unconditionally fprintf to stderr at each "needs_realloc returns true" site, removing the #ifndef NDEBUG gate the upstream messages are silenced by in Release builds). HE prompt 00 + ddtree-budget=22 + n_gen=256 over 26 spec-decode steps: Before: 56 needs_realloc events (alternating n_nodes 186 ↔ 3127), 14 cudaFree+cudaMalloc events. After: 3 needs_realloc events (initial only: 0 -> 3127, 0 -> 3079, 0 -> 186), 0 cudaFree+cudaMalloc events during decode. bench_he.py (RTX 3090, --n-gen 128, --ddtree-budget 22, 3-run mean): main: 86.72 tok/s this fix: 84.99 tok/s Within bench-noise on Linux/CUDA 12.6 because cudaMalloc is cheap on this stack — the saved per-step cost is small. The reporter's stack (Windows/CUDA 13/RTX 4090) has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery; that needs verification on the reporter's box.

…v-pad # Conflicts: # dflash/test/test_dflash.cpp

cubic-dev-ai

No issues found across 2 files

davide221 · 2026-05-07T14:05:01Z

@gtrak can you verify if this is solved?

This change brings concurrent multi-request execution to test_dflash on a single GPU. It is internally one cohesive unit but can be split into four conceptual pieces if a smaller review is preferred: 1. Multi TargetCache slots - CLI: --target-cache-slots=N (alias --cache-slots=N) - prefix `SLOT <id>` routes commands to a specific slot - DaemonSlotState + RAII ActiveDaemonSlot for safe switching - LIST_TARGET_CACHE_SLOTS for introspection - all slots share target/draft weights; only KV/SSM/scratch is per-slot - create_target_cache gains an `n_seqs` parameter so a single cache can be allocated batched up front 2. Tagged stream protocol (opt-in) - --stream-tagged emits frames `[-2, request_id, token]` instead of bare int32 tokens; sentinels `-4` (CONTINUE), `-1` (DONE) - parser recognises `REQ <id>` / `REQUEST <id>` headers - legacy bare-int32 streaming is unchanged when the flag is off - this lets a client demux multiple concurrent requests over the same stdout 3. Native quantum scheduler - dispatch table for REQ/SLOT/START, SCHED_STEP, SCHED_DRAIN, LIST_REQUESTS - cursor-based fair round-robin between admitted requests - non-blocking reader thread admits new requests during a drain - PendingQuantum{slot, req, epoch, n_gen} carries the unit of work - CONTINUE / CONT resumes a slot without re-prefilling - REQ <id> CANCEL invalidates a request and bumps the slot epoch so a stale CONTINUE is rejected; RESTORE_CHAIN / legacy generate refuse to overwrite a slot that is owned by an active scheduler request 4. Fused batched target step (CUDA path) - new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE, SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP, SCHED_BATCH_DRAIN - QwenGraphInputs gains `n_seqs`; build_delta_net_block accepts n_seqs > 1 - target_feat is allocated as [5*hidden, target_feat_cap, n_seqs] when batched and the chain forwards capture features per-seq - batch_probe_compare_ok smoke shows mismatches=0 vs the single-seq path; SCHED_BATCH_TARGET_TAIL commits two completed pending quanta in 29.26 ms; SCHED_BATCH_TARGET_STEP commits the next batched step in 29.57 ms; SCHED_BATCH_DRAIN completes req12/req13 with two batched steps each - rollback for partially accepted draft tokens, multi-token verify and parent-id propagation in the batched path are noted as follow-ups; today the batched step accepts the cleanest case and falls back to single-seq when needed Validation (single GPU1 RTX 6000 Ada sm_89, Heretic Q4_K_M target + Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV q4_0/q4_0): - Two concurrent requests: REQ 4 START SLOT 0 quantum=2 REQ 5 START SLOT 1 quantum=2 SCHED_DRAIN closes both clean. slot 0: 18.41 tok/s, slot 1: 22.50 tok/s - Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes without re-prefill. - batch_probe_compare_ok mismatches=0 over a 2-seq probe. - batch_tail_commit count=2 ms=29.26. - batch_step_commit ms=29.57 followed by SCHED_DRAIN reverts cleanly back to the DFlash single-seq path. Compatibility: - All new behaviour is opt-in. Default invocation of test_dflash with no scheduler flags keeps the legacy single-request path. - Tagged stream is gated behind --stream-tagged. - Multi-slot is gated behind --target-cache-slots=N (default N=1). - Batched target step is reached only via the SCHED_BATCH_* command family; legacy SCHED_STEP keeps using the single-seq path. - Hot-loop diagnostic logs (sync_us / step_debug) are now gated behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the default path is unchanged. Verification vs existing community PRs: - No prior art in lucebox-hub for the SCHED_BATCH_* protocol or for a native C++ quantum scheduler with REQ/SLOT/CONTINUE/CANCEL + epoch hardening. Checked against PR Luce-Org#39 (CUDA graph reuse) and PR Luce-Org#62 (split target/draft StepGraphs); both reuse / split graphs but neither exposes a multi-request slot protocol. - No upstream collision found for tagged stream framing or --target-cache-slots. Happy to split this into four sequential PRs (slots / tagged stream / quantum scheduler / batched target step) if a smaller-grained review is preferred — let me know. Author: Javier Pazo <xabicasa@gmail.com>

dusterbloom mentioned this pull request Apr 29, 2026

Slow decode on RTX 4090 and windows #55

Open

dusterbloom marked this pull request as draft April 30, 2026 12:56

dusterbloom added 2 commits April 30, 2026 15:07

dflash: preserve daemon cache reuse with split step graphs

c69740d

Merge remote-tracking branch 'origin/main' into fix/issue-55-stable-k…

0ce6832

…v-pad # Conflicts: # dflash/test/test_dflash.cpp

dusterbloom marked this pull request as ready for review April 30, 2026 13:15

cubic-dev-ai Bot reviewed Apr 30, 2026

View reviewed changes

dusterbloom force-pushed the fix/issue-55-stable-kv-pad branch from 05cb709 to 0ce6832 Compare May 1, 2026 16:29

javierpazo mentioned this pull request May 9, 2026

feat(dflash): native multi-request scheduler with batched target step #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62

dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62
dusterbloom wants to merge 3 commits intoLuce-Org:mainfrom
dusterbloom:fix/issue-55-stable-kv-pad

dusterbloom commented Apr 29, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

davide221 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dusterbloom commented Apr 29, 2026

Root cause

Fix

Verification

Bench (RTX 3090 / Linux / CUDA 12.6)

Test plan

What this does NOT fix

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davide221 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants